Generalized Additive Models in Fraud Detection

Data Science Capstone Project

Grace Allen, Kesi Allen, Sonya Melton, Pingping Zhou

2025-11-21

Introduction

What are generalized additive models?

  • Not your typical straight-line regression — GAMs let patterns curve naturally

  • Great at uncovering hidden trends in messy real-world data

  • Each feature gets its own shape, showing where risk rises or falls

  • Makes the model’s behavior easy to explain to non-technical teams

  • Perfect for fraud detection, where small pattern changes matter

Brief History of GAMs

Generalized Additive Models were introduced in the late 1980s as a way to add flexibility to traditional regression models. Trevor Hastie and Robert Tibshirani developed the framework to allow each predictor in a model to follow its own smooth pattern rather than forcing everything into a straight line. Through the 1990s and early 2000s, the approach grew in popularity in fields that needed interpretable models, including public health, ecology, and social sciences.

Brief History of GAMs

A major step forward came with the development of the mgcv package in R, created by Simon Wood. His work added modern smoothing techniques, automatic penalty selection, and faster computation, making GAMs practical for large and noisy datasets. Today, GAMs are widely used in finance, fraud detection, risk scoring, and other areas where organizations need both predictive accuracy and clear explanations.

GAMS in Action: Real World Uses + Our Study

GAMs help uncover nonlinear relationships and subtle patterns across diverse domains:

  • Financial Analytics: Detecting anomalies and potential fraud in transaction data

  • Banking & Insurance: Modeling risk scores in banking and insurance

GAMS in Action: Real World Uses + Our Study

GAMs help uncover nonlinear relationships and subtle patterns across diverse domains:

  • Environmental Science: Forecasting trends in environmental and climate research

  • Public Health: Understanding health outcomes and public health patterns

Our Project: Study Context: GAMs for Fraud Detection

  • Toolset: RStudio + package

  • Dataset: Kaggle’s Fraud Detection Transactions (Ashar, 2024)

  • Purpose: Identify predictive variables linked to fraudulent activity

  • Context: Synthetic but realistic data for controlled testing

Here’s how we used GAMs to explore patterns in the fraud dataset.

Methods

GAM Modeling Overview

  • GAMs extend traditional regression

  • Capture nonlinear predictor-response relationships

  • Use spline-based smooth functions

  • Combine continuous + categorical predictors

  • Fit with mgcv (penalized splines + GCV)

  • Model outputs interpretable smooth effects

  • Goal: Estimate probability of fraud

Modeling Workflow Steps

1. Data Acquisition

2. Data Exploration & Cleaning

3. Categorical Summary

4. Visualizations

5. Assumptions

6. GAM Analysis

7. GAM Model for predictors

8. Model Performance

9. Final Interpretation

GAM Equation

\[ g(\mu) = \alpha + s_1(X_1) + s_2(X_2) + \dots + s_p(X_p) \]

  • (g) = link function (logit for binary fraud)

  • Smooth functions capture nonlinear effects

  • Additive contributions from each predictor

  • Balances flexibility + interpretability

GAM Assumptions (Fraud Context)

  • Logit link approximates fraud probability

  • Additive and independent predictor effects

  • Smooth, gradual functional relationships

  • Binomial response distribution

  • Independent observations

  • Low predictor multicollinearity

  • Penalization prevents overfitting

Why We Chose GAMs For Fraud Detection

  • Captures nonlinear fraud patterns

  • Handles rare, imbalanced outcomes

  • Produces interpretable smooth risk curves

  • Supports regulatory transparency

  • Balances accuracy + interpretability

  • Strong literature support for fraud analytics

  • Scalable through mgcv’s automated smoothing

Practical Advantages & Relevance to Real-World Analytics

  • Supports investigative decision-making

  • Shows monotonic or nonlinear risk curves

  • Supports investigative decision-making

  • Can benchmark or surrogate black-box models

  • High recall for suspicious transactions

  • Useful for auditors, fraud teams, analysts

  • Aligns with both operational and compliance needs

Analysis and Results

Data Exploration and Visualization

Dataset Description

What It Is

  • A synthetic dataset built to mimic real financial transactions

  • Privacy‑safe: no real people’s data used

  • Hosted on Kaggle

Analysis and Results

Data Exploration and Visualization

Why We Use It

  • Train fraud detection models for binary classification tasks

  • Spot fraud: each transaction labeled as fraud (1) or not fraud (0)

Analysis and Results

Data Exploration and Visualization

What Makes It Special

Realistic fraud patterns:

  • Groups of fraudulent transactions

  • Subtle, hard‑to‑notice anomalies

  • Odd user behaviors

  • Large & diverse records: balances normal vs. rare fraud cases → addresses class imbalance.

Data Exploration and Visualization

Key Characteristics

What’s Inside

  • 50,000 Rows: A good amount of data to work with.

  • Two Labels: Every transaction is marked as either: 1 = Fraud 0 = Not Fraud

Data Exploration and Visualization

Data Features– 21 features across three categories:

  • Numbers: Like transaction amounts, risk scores, account balances.

  • Categories: Transaction types (payment, transfer, withdrawal), device types, merchant categories.

  • Time Data: When transactions happened (time, day) and their sequence.

Data Exploration and Visualization

Label Distribution Class Imbalance:

  • Fraudulent transactions are a small percentage, reflecting real-world scenarios.

  • Behavioral Realism: Includes unusual spending, behavioral signals, and high-risk profiles.

  • Modeling flexibility: supports interpretable (GAMs, logistic regression) or high-performance (XGBoost) approaches

Distribution of Variables

Table 1 – Transaction Types and Counts
Type Count
POS 12,549
Online 12,546
ATM Withdrawal 12,453
Bank Transfer 12,452
Table 2 – Device Types and Counts
Device Count
Tablet 16,779
Mobile 16,640
Laptop 16,581
Table 3 – Merchant Categories and Counts
Merchant_Category Count
Clothing 10,033
Groceries 10,019
Travel 10,015
Restaurants 9,976
Electronics 9,957

Distribution of Variables

Non-linearity Check

Modeling and Results

Assumptions

GAM Analysis for Numeric Variables

GAM Analysis for Categorical Variables

GAM Model for Key Predictor

GAM Equation for Key Predictor

GAM equation structure:

\[ g(\mu) = \alpha + s_1(X_1) + s_2(X_2) + \dots + s_p(X_p) \]

our model simplifies to a single predictor:

\[ \text{logit}(\Pr(\text{Fraud} = 1)) = \alpha + s(\text{Risk\_Score}) \]

where alpha = 1.9109 is the intercept, representing the baseline log-odds of fraud when Risk_Score is zero.